Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to bootstrapping #3480

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

yacovm
Copy link
Contributor

@yacovm yacovm commented Oct 17, 2024

Why this should be merged

How this works

How this was tested

yacovm and others added 12 commits October 16, 2024 18:34
The lifecycle of an avalanchego node can be separated to three main stages:

Stage I - If supported by the VM, then a state sync occurs. Otherwise, this stage is skipped.
In the X-chain, this stage is entirely substituted by a custom bootstrapping phase that replicates and executes the X-chain DAG.

Stage II - A stage called bootstrapping, which entails replicating the blocks from other avalanchego nodes, and executing them.

Stage III - The final form of an avalanchego node, in which it runs the snowman consensus protocol.

The phases are implemented by components called "engines":

- avalanche bootstrapping
- snowman
- state syncer
- snowman bootstrapper

And the handler in snow/networking/handler/ is responsible to route messages from the network into the correct engine.
Engines all implement the same common.Engine interface, but the interface consists of the union of all operations across all engines.

Indeed, it is often the case that a message of type `m` dispatched by engine `e`, cannot be processed by a different engine `e'.
For instance, a Chits message cannot be processed by any engine other than consensus, and a message about a query of state summary is only relevant to the state sync engine.

To that end, each engine simply implements a no-op dispatcher for messages it does not care about.

The biggest problem with the existing aforementioned structure, is that the lifecycle of the engines imposes a strict one way movement across the various stages,
and there is no single component which consolidates the transition among the stages. The movement between the various stages takes place by a callback given to each
engine at every stage but the final one (Stage I and Stage II).

The structure therefore makes it difficult to introduce movement from snowman consensus back to bootstrapping / state sync, or to have better control over the message dispatch.

This commit unifies all engines into a single one under snow/engine/unified

As a result, the implementation of the handler in snow/networking/handler/handler.go is now simpler,
as it only interacts with the unified engine, and it never needs to query the snow.EngineState.

The state transition between the various stages is now taken care of by the unified engine, and since the code to dispatch messages to the right engine
is now all in the unified engine, it's not only more testable but it lets us move among the stages in the same place where we consider the stage we're in to dispatch
the message to the correct engine.

Signed-off-by: Yacov Manevich <[email protected]>
This commit adds a mechanism that detects that our node is behind the majority of the stake.
The intent is to later have this mechanism be the trigger for the bootstrapping mechanism.
Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point.

The mechanism works in the following manner:

- It intercepts the snowman engine's Chits message handling, and upon every reception of the Chits message,
  the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked,
  if it wasn't invoked too recently.
- The mechanism draws statistics from the validators known to it, and computes the latest accepted block for each validator.
- The mechanism then proceeds to determine which blocks are pending to be processed (a block pending to be processed was not accepted).
- The mechanism then collects a snapshot of all blocks it hasn't accepted yet, and the amount of stake that has accepted this block.
- The mechanism then waits for its next invocation, in order to see if it has accepted blocks correlated with enough stake.
- If there is too much stake that has accepted blocks by other nodes correlated to it that the node hasn't accepted,
  then the mechanism announces the node is behind, and returns the time period between the two invocations.
- The mechanism sums the total time it has detected the node is behind, until a sampling concludes it is not behind, and then
  the total time is nullified.

Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
@marun
Copy link
Contributor

marun commented Oct 18, 2024

  • If this PR is based on Unify engines #3405, please change the base from master to show only the new commits.
  • Is it really necessary for these PRs to be so big? The scale here is large enough to make diligent review very difficult and the effort required of your reviewers will be considerable.

@yacovm
Copy link
Contributor Author

yacovm commented Oct 18, 2024

  • If this PR is based on Unify engines #3405, please change the base from master to show only the new commits.

Yeah I need to tidy up the commit chain. I didn't mean to post this for review, I forgot to put it in draft mode.

  • Is it really necessary for these PRs to be so big? The scale here is large enough to make diligent review very difficult and the effort required of your reviewers will be considerable.

I can split the PRs to pieces, that's not an issue. I just got the PR you mention working two days ago, so I didn't think it would be reviewed by anyone anytime soon :-)

@yacovm yacovm marked this pull request as draft October 18, 2024 12:35
@yacovm yacovm force-pushed the fallbackToBootstrapping branch 5 times, most recently from 3eb6fb1 to c906bc3 Compare October 18, 2024 20:16
Signed-off-by: Yacov Manevich <[email protected]>
Copy link

This PR has become stale because it has been open for 30 days with no activity. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants